Skip to main content

Some Unsupervised Learning Model

Unsupervised learning is a the study without labels, normally, is the task of grouping, explaining and finding structured data.

Clustering

Clustering is the task of organizing data into groups/clusters, we always use K-means algorithm to do it. Normally, when we see that multiple clusters shows in group we call its multimodal distributed, and the process of grouping those modes into clusters without label is clustering.

K-means algorithm is to partition the data into K distinct, non-overlapping clusters. Let C1,,CkC_1, \cdots, C_k be the clusters, and the union of set CiC_i is the whole data set. Since non-overlapping, then for all i, j, CiCj=C_i \cap C_j = \emptyset. The key idea is to select a good clustering ensures the within-cluster variation as small as possible and we define within-cluster variation as W(Ck)W(C_k) which is a measure on the difference among observations with a cluster. We want to find C1,,CkC_1, \ldots, C_k to minimize i=1kW(Ci)\sum_{i=1}^k W(C_i).

  • Common to use Euclidean distance as the measure of difference, W(Ck)=1Cki,iCkj=1p(xijxij)2W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k}\sum_{j = 1}^p (x_{ij}x_{i'j})^2 where Ck|C_k| is the number of observations in cluster CkC_k.
    • let xˉk=1CkiCkxi\bar{x}_k = \frac{1}{|C_k|} \sum_{i \in C_k} x_i be the centroid of cluster CkC_k, then W(Ck)=1CkiCkj=1p(xijxˉkj)2=1CkiCkj=1pxixˉk22W(C_k) = \frac{1}{|C_k|} \sum_{i \in C_k} \sum_{j = 1}^p (x_{ij} - \bar{x}_{kj})^2 = \frac{1}{|C_k|} \sum_{i \in C_k} \sum_{j = 1}^p \|x_i - \bar{x}_k\|^2_2.
  • There are KnK^n ways to partition nn observations into KK clusters which is much computing expensive.

Steps:

  1. Initialization: randomly initialize cluster centers to each of the observations.
  2. Then iteratively alternates between two steps until assignment stop changing :
    • Assignment step: Assign each observation to the closest cluster
      • decrease the total within-cluster variation.
    • Re-center step: Move each cluster center to the mean of the data assigned to it.
      • decrease the total within-cluster variation.

We may add some extension for K-means algorithm:

  • Non-exhaustive clustering: allow some of the data points not to belong to any cluster.
  • Overlapping clustering: allow some data points to belong to more than one cluster (a.k.a K-means).

Principal Component Analysis

Principal Component Analysis (PCA) is used for dimensionality reduction where map data to a lower dimensional space. The idea of PCA is finds linear low-dimensional representations of the data by preserving as much variation as possible.

We define the kkth principal component(PC) as Zk=u1kX1++upkXpZ_k = u_{1k}X_1 + \cdots + u_{pk}X_p where u1k,,upku_{1k}, \ldots, u_{pk}, and we have constraint wehre:

  • ZkZ_k has the largest variance
  • UkT=(u1k,,upk)U_k^T = (u_{1k}, \ldots, u_{pk}) should be normalized where Uk2=j=1pukj2=1\|U_k\|_2 = \sum_{j = 1}^p u_{kj}^2 = 1 and we call it the loading vector. We select UkU_k by argmaxu1ni=1n(j=1pukjXij)2\arg\max\limits_{u} \frac{1}{n} \sum_{i = 1}^n (\sum_{j = 1}^pu_{kj}X_{ij})^2 or the matrix form argmaxu1nuTXTXu\arg\max\limits_{u} \frac{1}{n} u^TX^TXu. Moreover, it's the kkth eigenvectores of Σ^\hat \Sigma
  • each ZkZ_k is uncorrelated with ZkZ_{k'} for kkk \neq k'.

Givne a data frame XRn×pX\in \mathbb{R}^{n \times p}, we want to construct KK PCs with K{1,,p}K\in \{1, \ldots, p\}, we can use the following steps:

  1. Center X such that the4 columns have zero mean: X~=X1nXˉ\tilde{X} = X - 1_n\bar{X} where Xˉ=1ni=1nXi\bar{X} = \frac{1}{n} \sum_{i = 1}^n X_i and 1n1_n is the n×1n \times 1 matrix so that 1nXˉ1_n\bar{X} is a n×pn \times p matrix.
  2. Compute the first KK loading UK=(u1k,,upk)U_K = (u_{1k}, \ldots, u_{pk}) from the centered data X~\tilde{X}.
  3. Obtain the first K PCs Z~K=X~UK\tilde{Z}_K = \tilde{X}U_K.
  4. Add the centers back to the PCs Z=Z~+1nXˉTUK=XUKZ = \tilde{Z} + 1_n\bar{X}^TU_K = XU_K.